{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Credits\n",
    "TensorFlow translation of [Lasagne tutorial](https://github.com/DeepLearningDTU/nvidia_deep_learning_summercamp_2016/blob/master/lab1/lab1_FFN.ipynb). Thanks to [skaae](https://github.com/skaae), [casperkaae](https://github.com/casperkaae) and [larsmaaloee](https://github.com/larsmaaloee)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dependancies and supporting functions\n",
    "Loading dependancies and supporting functions by running the code block below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import sklearn.datasets\n",
    "import tensorflow as tf\n",
    "from tensorflow.python.framework.ops import reset_default_graph\n",
    "\n",
    "# Do not worry about the code below for now, it is used for plotting later\n",
    "def plot_decision_boundary(pred_func, X, y):\n",
    "    #from https://github.com/dennybritz/nn-from-scratch/blob/master/nn-from-scratch.ipynb\n",
    "    # Set min and max values and give it some padding\n",
    "    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n",
    "    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n",
    "    \n",
    "    h = 0.01\n",
    "    # Generate a grid of points with distance h between them\n",
    "    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "    \n",
    "    yy = yy.astype('float32')\n",
    "    xx = xx.astype('float32')\n",
    "    # Predict the function value for the whole gid\n",
    "    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])[:,0]\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    # Plot the contour and training examples\n",
    "    plt.figure()\n",
    "    plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu)\n",
    "    plt.scatter(X[:, 0], X[:, 1], c=-y, cmap=plt.cm.Spectral)\n",
    "\n",
    "def onehot(t, num_classes):\n",
    "    out = np.zeros((t.shape[0], num_classes))\n",
    "    for row, col in enumerate(t):\n",
    "        out[row, col] = 1\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural networks 101\n",
    "In this notebook you will implement a simple neural network in TensorFlow utilizing the graph building and automatic differentiation engine of TensorFlow. We assume that you are already familiar with backpropagation (if not please see [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/) or [Michal Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html).\n",
    "We'll not spend much time on how TensorFlow works, but you can refer to [this short tutorial](https://www.tensorflow.org/versions/r0.10/get_started/basic_usage.html) if you are interested, or [the python documentation](https://www.tensorflow.org/versions/r0.10/api_docs/index.html).\n",
    "\n",
    "(Additionally, for the ambitious people we have previously made an assignment where you will implement both the forward and backpropagation in a neural network by hand, https://github.com/DTU-deeplearning/day1-NN/blob/master/exercises_1.ipynb)(Ole, skal jeg også implementere det?)\n",
    "\n",
    "In this exercise we'll start right away by defining logistic regression model in TensorFlow. Some details of TensorFlow can be a bit confusing, however you'll pick them up when you worked with it for some time. We'll initially start with a simple 2-D and 2-class classification problem where the class decision boundary can be visualized. Initially we show that logistic regression can only separate classes linearly. Adding a Non-linear hidden layer to the algorithm permits nonlinear class separation. If time permits we'll continue on to implement a fully conencted neural network to classify the (in)famous MNIST dataset consisting of images of hand written digits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Problem \n",
    "We'll initally demonstrate the that MLPs can classify non-linear problems whereas simple logistic regression cannot. For ease of visualization and computationl speed we initially experiment on the simple 2D half-moon dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Generate a dataset and plot it\n",
    "np.random.seed(0)\n",
    "num_samples = 300\n",
    "\n",
    "X, y = sklearn.datasets.make_moons(num_samples, noise=0.20)\n",
    "\n",
    "X_tr = X[:100].astype('float32')\n",
    "X_val = X[100:200].astype('float32')\n",
    "X_te = X[200:].astype('float32')\n",
    "\n",
    "y_tr = y[:100].astype('int32')\n",
    "y_val = y[100:200].astype('int32')\n",
    "y_te = y[200:].astype('int32')\n",
    "\n",
    "plt.scatter(X_tr[:,0], X_tr[:,1], s=40, c=y_tr, cmap=plt.cm.BuGn)\n",
    "\n",
    "print X.shape, y.shape\n",
    "\n",
    "num_features = X_tr.shape[-1]\n",
    "num_output = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# From Logistic Regression to \"Deep Learning\" in Lasagne\n",
    "The code implements logistic regression in TensorFlow. In section __Assignments Half Moon__ you are asked to modify the code into a neural network.\n",
    "\n",
    "The building blocks of TensorFlow are variables and operations, with these we can form computational graphs that form neural networks.\n",
    "\n",
    "The [tf.placeholder](https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html#placeholder) allows us to feed our input data to the computational graph. We can define constraints with the shape of the placeholder to only take a tensor of a certain shape. Note that it is common to provide ``None`` for the first dimension, which allows us to vary the batch size at runtime.\n",
    "\n",
    "The [tf.Variable](https://www.tensorflow.org/versions/r0.10/api_docs/python/state_ops.html#Variable) allows us to store and update Tensors in our graph. Variables are used to build weights for our neural network. Note that we will use a wrapper called [`tf.get_variable`](https://www.tensorflow.org/versions/r0.10/api_docs/python/state_ops.html#get_variable) througout this tutorial.\n",
    "\n",
    "The [tf.Operation](https://www.tensorflow.org/versions/r0.10/api_docs/python/framework.html#Operation) allows us to perform operations on tensors, resulting in new tensors. Such as when computing the logistic regression which is implemented below:\n",
    "\n",
    "$$y = nonlinearity(xW + b)$$\n",
    "\n",
    "where $x$ is the input tensor, $y$ is the output tensor and $\\{W, b\\}$ are the weights (variable tensors). The weights are initialized with an initializer of our choice (check [tensorflow's API](https://www.tensorflow.org/versions/r0.10/api_docs/index.html) for more.\n",
    "x has shape ```[batchsize, num_features]```. ```W``` has shape ```[num_features, num_units]``` and b has ```[num_units]```. y has then ```[batch_size, num_units]```.\n",
    "\n",
    "NOTE: to make building neural networks easier, TensorFlow's [contrib](https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#layers-contrib) wraps TensorFlow functionality to support various operations such as; [convolutions](https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#convolution2d), [batch_norm](https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#batch_norm), [fully_connected](https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#fully_connected).\n",
    "\n",
    "In this first exercise we will use basic TensorFlow functions so that you can learn how to build it from scratch. This will help you later if you want to build your own custom operations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TensorFlow Playerground\n",
    "\n",
    "If you are new to Neural Networks, start by using the [TensorFlow playground](http://playground.tensorflow.org/) to familiarize yourself with hidden layers, hidden units, activations, learning rate, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# resets the graph, needed when initializing weights multiple times, like in this notebook\n",
    "reset_default_graph()\n",
    "\n",
    "# Setting up placeholder, this is where your data enters the graph!\n",
    "x_pl = tf.placeholder(tf.float32, [None, num_features])\n",
    "\n",
    "# Setting up variables, these variables are weights in your network that can be update while running our graph.\n",
    "# Notice, to make a hidden layer, the weights needs to have the following dimensionality\n",
    "# W[number_of_units_going_in, number_of_units_going_out]\n",
    "# b[number_of_units_going_out]\n",
    "# in the example below we have 2 input units (num_features) and 2 output units (num_output)\n",
    "# so our weights become W[2, 2], b[2]\n",
    "# if we want to make a hidden layer with 100 units, we need to define the shape of the\n",
    "# first weight to W[2, 100], b[2] and the shape of the second weight to W[100, 2], b[2]\n",
    "\n",
    "# defining our initializer for our weigths from a normal distribution (mean=0, std=0.1)\n",
    "weight_initializer = tf.truncated_normal_initializer(stddev=0.1)\n",
    "with tf.variable_scope('l_1'): # if you run it more than once, reuse has to be True\n",
    "    W_1 = tf.get_variable('W', [num_features, num_output], # change num_output to 100 for mlp\n",
    "                          initializer=weight_initializer)\n",
    "    b_1 = tf.get_variable('b', [num_output], # change num_output to 100 for mlp\n",
    "                          initializer=tf.constant_initializer(0.0))\n",
    "# with tf. variable_scope('l_2'):\n",
    "#     W_2 = tf.get_variable('W', [100, num_output],\n",
    "#                           initializer=weight_initializer)\n",
    "#     b_2 = tf.get_variable('b', [num_output],\n",
    "#                           initializer=tf.constant_initializer(0.0))\n",
    "\n",
    "# Setting up ops, these ops will define edges along our computational graph\n",
    "# The below ops will compute a logistic regression, but can be modified to compute\n",
    "# a neural network\n",
    "\n",
    "l_1 = tf.matmul(x_pl, W_1) + b_1\n",
    "# to make a hidden layer we need a nonlinearity\n",
    "# l_1_nonlinear = tf.nn.relu(l_1)\n",
    "# the layer before the softmax should not have a nonlinearity\n",
    "# l_2 = tf.matmul(l_1_nonlinear, W_2) + b_2\n",
    "y = tf.nn.softmax(l_1) # change to l_2 for MLP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# knowing how to print your tensors and ops is useful, here are some examples\n",
    "print(\"---placeholders---\")\n",
    "print(x_pl.name)\n",
    "print(x_pl)\n",
    "print\n",
    "print(\"---weights---\")\n",
    "print(W_1.name)\n",
    "print(W_1.get_shape())\n",
    "print(W_1)\n",
    "print\n",
    "print(b_1.name)\n",
    "print(b_1)\n",
    "print(b_1.get_shape())\n",
    "print\n",
    "print(\"---ops---\")\n",
    "print(l_1.name)\n",
    "print(l_1)\n",
    "print\n",
    "print(y.name)\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have built the network we have our tensors in our default [graph](https://www.tensorflow.org/versions/r0.10/api_docs/python/framework.html#Graph), which we can use to build the cost function and training part.\n",
    "\n",
    "Further, using our default graph we can print the operations and variables of our default graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# y_ is a placeholder variable taking on the value of the target batch.\n",
    "y_ = tf.placeholder(tf.float32, [None, num_output])\n",
    "\n",
    "# computing cross entropy per sample\n",
    "cross_entropy = -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])\n",
    "\n",
    "# averaging over samples\n",
    "cross_entropy = tf.reduce_mean(cross_entropy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that our weights and operations defined in the `l_1` space are saved in the `l_1` directory of the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# using the graph to print ops\n",
    "print(\"operations\")\n",
    "operations = [op.name for op in tf.get_default_graph().get_operations()]\n",
    "print(operations)\n",
    "print\n",
    "# variables are accessed through tensorflow\n",
    "print(\"variables\")\n",
    "variables = [var.name for var in tf.all_variables()]\n",
    "print(variables)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train our neural network we need to update the parameters in direction of the negative gradient w.r.t the cost function we defined earlier.\n",
    "We can use `tf.train.Optimizer` to get the gradients (using `compute_gradients`) for all parameters in the network w.r.t ``cost_train``.\n",
    "Imagine that `cost_train` is a function and we want to go downhill. We go downhill by changing the value of the paramters in direction of the negative gradient. \n",
    "\n",
    "Finally we can use the built-in `minimize` to calculate the stochastic gradient descent (SGD) update rule for each paramter in the network.\n",
    "\n",
    "Heres a small animation of gradient descent: http://imgur.com/a/Hqolp . E.g why saddle points might be difficult.\n",
    "To use the other optimizers checkout which optimizers TensorFlow [supports](https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#optimizers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Defining our optimizer (try with different optimizers here!)\n",
    "optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)\n",
    "\n",
    "# Computing our gradients\n",
    "grads_and_vars = optimizer.compute_gradients(cross_entropy)\n",
    "\n",
    "# Applying the gradients\n",
    "train_op = optimizer.apply_gradients(grads_and_vars)\n",
    "\n",
    "# Notice, alternatively you can use train_op = optimizer.minimize(crossentropy)\n",
    "# instead of the three steps above"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we make the prediction functions, such that we can get an accuracy measure over a batch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# making a one-hot encoded vector of correct (1) and incorrect (0) predictions\n",
    "correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))\n",
    "\n",
    "# averaging the one-hot encoded vector\n",
    "accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to utilize our `train_op` function repeatedly in order to optimize our weights `W_1` and `b_1` to make the best possible linear seperation of the half moon dataset.\n",
    "\n",
    "It is worth or read a short introduction on TensorFlow [sessions](https://www.tensorflow.org/versions/r0.10/api_docs/python/client.html#Session) before continuing to the next codeblock. Sessions are used to run TensorFlow graphs, they uses `fetches` to decide which parts of the graph to compute and `feed_dicts` to load data into the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# defining a function to make predictions using our classifier\n",
    "def pred(X_in, sess):\n",
    "    # first we must define what data to give it\n",
    "    feed_dict = {x_pl: X_in}\n",
    "    # secondly our fetches\n",
    "    fetches = [y]\n",
    "    # utilizing the given session (ref. sess) to compute results\n",
    "    res = sess.run(fetches, feed_dict)\n",
    "    # res is a list with each indices representing the corresponding element in fetches\n",
    "    return res[0]\n",
    "\n",
    "# Training loop\n",
    "num_epochs = 1000\n",
    "\n",
    "train_cost, val_cost, val_acc = [],[],[]\n",
    "# restricting memory usage, TensorFlow is greedy and will use all memory otherwise\n",
    "gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.2)\n",
    "with tf.Session(config=tf.ConfigProto(gpu_options=gpu_opts)) as sess:\n",
    "    \n",
    "    # initializing all variables\n",
    "    init = tf.initialize_all_variables()\n",
    "    sess.run(init)\n",
    "    plot_decision_boundary(lambda x: pred(x, sess), X_val, y_val)\n",
    "    plt.title(\"Untrained Classifier\")\n",
    "    for e in range(num_epochs):\n",
    "        ### TRAINING ###\n",
    "        # what to feed to our train_op\n",
    "        # notice we onehot encode our predictions to change shape from (batch,) -> (batch, num_output)\n",
    "        feed_dict_train = {x_pl: X_tr, y_: onehot(y_tr, num_output)}\n",
    "        \n",
    "        # deciding which parts to fetch, train_op makes the classifier \"train\"\n",
    "        fetches_train = [train_op, cross_entropy]\n",
    "        \n",
    "        # running the train_op\n",
    "        res = sess.run(fetches=fetches_train, feed_dict=feed_dict_train)\n",
    "        # storing cross entropy (second fetch argument, so index=1)\n",
    "        train_cost += [res[1]]\n",
    "    \n",
    "        ### VALIDATING ###\n",
    "        # what to feed our accuracy op\n",
    "        feed_dict_valid = {x_pl: X_val, y_: onehot(y_val, num_output)}\n",
    "\n",
    "        # deciding which parts to fetch\n",
    "        fetches_valid = [cross_entropy, accuracy]\n",
    "\n",
    "        # running the validation\n",
    "        res = sess.run(fetches=fetches_valid, feed_dict=feed_dict_valid)\n",
    "        val_cost += [res[0]]\n",
    "        val_acc += [res[1]]\n",
    "\n",
    "        if e % 100 == 0:\n",
    "            print \"Epoch %i, Train Cost: %0.3f\\tVal Cost: %0.3f\\t Val acc: %0.3f\"%(e, train_cost[-1],val_cost[-1],val_acc[-1])\n",
    "\n",
    "    ### TESTING ###\n",
    "    # what to feed our accuracy op\n",
    "    feed_dict_test = {x_pl: X_te, y_: onehot(y_te, num_output)}\n",
    "\n",
    "    # deciding which parts to fetch\n",
    "    fetches_test = [cross_entropy, accuracy]\n",
    "\n",
    "    # running the validation\n",
    "    res = sess.run(fetches=fetches_test, feed_dict=feed_dict_test)\n",
    "    test_cost = res[0]\n",
    "    test_acc = res[1]\n",
    "    print \"\\nTest Cost: %0.3f\\tTest Accuracy: %0.3f\"%(test_cost, test_acc)\n",
    "    \n",
    "    # For plotting purposes\n",
    "    plot_decision_boundary(lambda x: pred(x, sess), X_te, y_te)\n",
    "\n",
    "# notice: we do not need to use the session environment anymore, so returning from it.\n",
    "plt.title(\"Trained Classifier\")\n",
    "\n",
    "epoch = np.arange(len(train_cost))\n",
    "plt.figure()\n",
    "plt.plot(epoch,train_cost,'r',epoch,val_cost,'b')\n",
    "plt.legend(['Train Loss','Val Loss'])\n",
    "plt.xlabel('Updates'), plt.ylabel('Loss')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignments Half Moon\n",
    "\n",
    " 1) A linear logistic classifier is only able to create a linear decision boundary. Change the Logistic classifier into a (non-linear) Neural network by inserting a dense hidden layer between the input and output layers of the model\n",
    " \n",
    " 2) Experiment with multiple hidden layers or more / less hidden units. What happens to the decision boundary?\n",
    " \n",
    " 3) Overfitting: When increasing the number of hidden layers / units the neural network will fit the training data better by creating a highly nonlinear decision boundary. If the model is to complex it will often generalize poorly to new data (validation and test set). Can you obseve this from the training and validation errors? \n",
    " \n",
    " 4) We used the vanilla stocastic gradient descent algorithm for parameter updates. This is usually slow to converge and more sophisticated pseudo-second-order methods usually works better. Try changing the optimizer to [adam or momentum](https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#AdamOptimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optional:  MNIST dataset\n",
    "MNIST is a dataset that is often used for benchmarking. The MNIST dataset consists of 70,000 images of handwritten digits from 0-9. The dataset is split into a 50,000 images training set, 10,000 images validation set and 10,000 images test set. The images are 28x28 pixels, where each pixel represents a normalised value between 0-255 (0=black and 255=white).\n",
    "\n",
    "### Primer for the afternoon...\n",
    "We use a feedforward neural network to classify the 28x28 mnist images. ``num_features`` is therefore 28x28=784.\n",
    "That is, we represent each image as a vector. The ordering of the pixels in the vector does not matter, so we could permutate all images using the same permutation and still get the same performance. (Your are of course encouraged to try this using ``numpy.random.permutation`` to get a random permutation :)). This task is therefore called the _permutation invariant_ MNIST. Obviously this throws away a lot of structure in the data. After lunch we'll fix this with the convolutional neural network wich encodes prior knowledgde about data that has either spatial or temporal structure.  \n",
    "\n",
    "### Ballpark estimates of hyperparameters\n",
    "__Optimizers:__\n",
    "    1. SGD + Momentum: learning rate 1.0 - 0.1 \n",
    "    2. ADAM: learning rate 3*1e-4 - 1e-5\n",
    "    3. RMSPROP: somewhere between SGD and ADAM\n",
    "\n",
    "__Regularization:__\n",
    "    1. Dropout. Dropout rate 0.1-0.5\n",
    "    2. L2 and L1 regularization - https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#regularizers.\n",
    "    Not used that often in deep learning, but 1e-4  -  1e-8.\n",
    "    3. Batchnorm: Batchnorm also acts as a regularizer - https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#batch_norm\n",
    "    Often very useful (faster and better convergence)\n",
    "    \n",
    "    \n",
    "__Parameter initialization__\n",
    "    Parameter initialization is extremely important. TensorFlow has a lot of different initializers, check the TensorFlow API [documentation](https://www.tensorflow.org/versions/r0.10/api_docs/index.html). Often used initializer are\n",
    "    1. He - (not available in TensorFlow's API)\n",
    "    2. Glorot - https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#xavier_initializer\n",
    "    3. Uniform or Normal with small scale. (0.1 - 0.01) - https://www.tensorflow.org/versions/r0.10/api_docs/python/state_ops.html#random_normal_initializer\n",
    "    4. Orthogonal (I find that this works very well for RNNs) - (not available in TensorFlow's API)\n",
    "\n",
    "Bias is nearly always initialized to zero using the [tf.constant_initializer](https://www.tensorflow.org/versions/r0.10/api_docs/python/state_ops.html#constant_initializer).\n",
    "\n",
    "__Number of hidden units and network structure__\n",
    "   Probably as big network as possible and then apply regularization. You'll have to experiment :). One rarely goes below 512 units for feedforward networks unless your are training on CPU...\n",
    "   Theres is some research into stochstic depth networks: https://arxiv.org/pdf/1603.09382v2.pdf, but in general this is trail and error.\n",
    "\n",
    "__Nonlinearity__: [The most commonly used nonliearities are](https://www.tensorflow.org/versions/r0.10/api_docs/python/nn.html#activation-functions)\n",
    "    \n",
    "    1. ReLU\n",
    "    2. Leaky ReLU. Same as \n",
    "    3. Elu\n",
    "    3. Sigmoids are used if your output is binary. It is not used in the hidden layers. Squases the output between -1 and 1\n",
    "    4. Softmax used as output if you have a classification problem. Normalizes the the output to 1. )\n",
    "\n",
    "\n",
    "See the plot below.\n",
    "\n",
    "__mini-batch size__\n",
    "   Usually people use 16-256. Bigger is not allways better. With smaller mini-batch size you get more updates and your model might converge faster. Also small batchsizes uses less memory  -> you can train a model with more parameters.\n",
    "\n",
    "Hyperparameters can be found by experience (guessing) or some search procedure. Random search is easy to implement and performs decent: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf . \n",
    "More advanced search procedures include [SPEARMINT](https://github.com/JasperSnoek/spearmint) and many others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# PLOT OF DIFFERENT OUTPUT USNITS\n",
    "x = np.linspace(-6, 6, 100)\n",
    "relu = lambda x: np.maximum(0, x)\n",
    "leaky_relu = lambda x: np.maximum(0, x) + 0.1*np.minimum(0, x) # probably a slow implementation....\n",
    "elu = lambda x: (x > 0)*x + (1 - (x > 0))*(np.exp(x) - 1) \n",
    "sigmoid = lambda x: (1+np.exp(-x))**(-1)\n",
    "def softmax(w, t = 1.0):\n",
    "    e = np.exp(w)\n",
    "    dist = e / np.sum(e)\n",
    "    return dist\n",
    "x_softmax = softmax(x)\n",
    "\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.plot(x, relu(x), label='ReLU', lw=2)\n",
    "plt.plot(x, leaky_relu(x), label='Leaky ReLU',lw=2)\n",
    "plt.plot(x, elu(x), label='Elu', lw=2)\n",
    "plt.plot(x, sigmoid(x), label='Sigmoid',lw=2)\n",
    "plt.legend(loc=2, fontsize=16)\n",
    "plt.title('Non-linearities', fontsize=20)\n",
    "plt.ylim([-2, 5])\n",
    "plt.xlim([-6, 6])\n",
    "\n",
    "# softmax\n",
    "# assert that all class probablities sum to one\n",
    "print np.sum(x_softmax)\n",
    "assert abs(1.0 - x_softmax.sum()) < 1e-8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## MNIST\n",
    "First let's load the MNIST dataset and plot a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#To speed up training we'll only work on a subset of the data\n",
    "data = np.load('mnist.npz')\n",
    "num_classes = 10\n",
    "x_train = data['X_train'][:1000].astype('float32')\n",
    "targets_train = data['y_train'][:1000].astype('int32')\n",
    "\n",
    "x_valid = data['X_valid'][:500].astype('float32')\n",
    "targets_valid = data['y_valid'][:500].astype('int32')\n",
    "\n",
    "x_test = data['X_test'][:500].astype('float32')\n",
    "targets_test = data['y_test'][:500].astype('int32')\n",
    "\n",
    "print \"Information on dataset\"\n",
    "print \"x_train\", x_train.shape\n",
    "print \"targets_train\", targets_train.shape\n",
    "print \"x_valid\", x_valid.shape\n",
    "print \"targets_valid\", targets_valid.shape\n",
    "print \"x_test\", x_test.shape\n",
    "print \"targets_test\", targets_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#plot a few MNIST examples\n",
    "idx = 0\n",
    "canvas = np.zeros((28*10, 10*28))\n",
    "for i in range(10):\n",
    "    for j in range(10):\n",
    "        canvas[i*28:(i+1)*28, j*28:(j+1)*28] = x_train[idx].reshape((28, 28))\n",
    "        idx += 1\n",
    "plt.figure(figsize=(7, 7))\n",
    "plt.axis('off')\n",
    "plt.imshow(canvas, cmap='gray')\n",
    "plt.title('MNIST handwritten digits')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Hyperparameters\n",
    "\n",
    "num_classes = 10\n",
    "num_l1 = 512\n",
    "num_features = x_train.shape[1]\n",
    "\n",
    "# resetting the graph ...\n",
    "reset_default_graph()\n",
    "\n",
    "# Setting up placeholder, this is where your data enters the graph!\n",
    "x_pl = tf.placeholder(tf.float32, [None, num_features])\n",
    "\n",
    "# defining our weight initializers\n",
    "weight_initializer = tf.truncated_normal_initializer(stddev=0.1)\n",
    "\n",
    "# Setting up the trainable weights of the network\n",
    "with tf.variable_scope('l_1'):\n",
    "    W_1 = tf.get_variable('W', [num_features, num_l1],\n",
    "                          initializer=weight_initializer)\n",
    "    b_1 = tf.get_variable('b', [num_l1],\n",
    "                          initializer=tf.constant_initializer(0.0))\n",
    "\n",
    "with tf.variable_scope('l_2'):\n",
    "    W_2 = tf.get_variable('W', [num_l1, num_classes],\n",
    "                          initializer=weight_initializer)\n",
    "    b_2 = tf.get_variable('b', [num_classes],\n",
    "                          initializer=tf.constant_initializer(0.0))\n",
    "\n",
    "\n",
    "# Building the layers of the neural network\n",
    "l1 = tf.matmul(x_pl, W_1) + b_1\n",
    "l1_nonlinear = tf.nn.elu(l1) # you can try with various activation functions!\n",
    "l2 = tf.matmul(l1, W_2) + b_2\n",
    "y = tf.nn.softmax(l2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# y_ is a placeholder variable taking on the value of the target batch.\n",
    "y_ = tf.placeholder(tf.float32, [None, num_classes])\n",
    "\n",
    "# computing cross entropy per sample\n",
    "cross_entropy = -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])\n",
    "\n",
    "# averaging over samples\n",
    "loss_tn = tf.reduce_mean(cross_entropy)\n",
    "\n",
    "# L2 regularization\n",
    "#reg_scale = 0.0001\n",
    "#regularize = tf.contrib.layers.l2_regularizer(reg_scale)\n",
    "#params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)\n",
    "#reg_term = sum([regularize(param) for param in params])\n",
    "#loss_tn += reg_term\n",
    "\n",
    "# defining our optimizer\n",
    "optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)\n",
    "\n",
    "# applying the gradients\n",
    "train_op = optimizer.minimize(loss_tn)\n",
    "\n",
    "# notice, alternatively you can use train_op = optimizer.minimize(crossentropy)\n",
    "# instead of the three steps above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Test the forward pass\n",
    "x = np.random.normal(0,1, (45, 28*28)).astype('float32') #dummy data\n",
    "\n",
    "# restricting memory usage, TensorFlow is greedy and will use all memory otherwise\n",
    "gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.2)\n",
    "# initialize the Session\n",
    "sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_opts))\n",
    "sess.run(tf.initialize_all_variables())\n",
    "res = sess.run(fetches=[y], feed_dict={x_pl: x})\n",
    "print \"y\", res[0].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build the training loop.\n",
    "We train the network by calculating the gradient w.r.t the cost function and update the parameters in direction of the negative gradient. \n",
    "\n",
    "\n",
    "When training neural network you always use mini batches. Instead of calculating the average gradient using the entire dataset you approximate the gradient using a mini-batch of typically 16 to 256 samples. The paramters are updated after each mini batch. Networks converges much faster using minibatches because the paramters are updated more often.\n",
    "\n",
    "We build a loop that iterates over the training data. Remember that the parameters are updated each time ``f_train`` is called."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# using confusionmatrix to handle \n",
    "from confusionmatrix import ConfusionMatrix\n",
    "\n",
    "# setting hyperparameters and gettings epoch sizes\n",
    "batch_size = 100\n",
    "num_epochs = 100\n",
    "num_samples_train = x_train.shape[0]\n",
    "num_batches_train = num_samples_train // batch_size\n",
    "num_samples_valid = x_valid.shape[0]\n",
    "num_batches_valid = num_samples_valid // batch_size\n",
    "\n",
    "# setting up lists for handling loss/accuracy\n",
    "train_acc, train_loss = [], []\n",
    "valid_acc, valid_loss = [], []\n",
    "test_acc, test_loss = [], []\n",
    "cur_loss = 0\n",
    "loss = []\n",
    "## TRAINING ##\n",
    "for epoch in range(num_epochs):\n",
    "    #Forward->Backprob->Update params\n",
    "    cur_loss = 0\n",
    "    for i in range(num_batches_train):\n",
    "        idx = range(i*batch_size, (i+1)*batch_size)\n",
    "        x_batch = x_train[idx]\n",
    "        target_batch = targets_train[idx]\n",
    "        feed_dict_train = {x_pl: x_batch, y_: onehot(target_batch, num_classes)}\n",
    "        fetches_train = [train_op, loss_tn]\n",
    "        res = sess.run(fetches=fetches_train, feed_dict=feed_dict_train)\n",
    "        batch_loss = res[1]\n",
    "        cur_loss += batch_loss\n",
    "    loss += [cur_loss/batch_size]\n",
    "    \n",
    "    confusion_valid = ConfusionMatrix(num_classes)\n",
    "    confusion_train = ConfusionMatrix(num_classes)\n",
    "\n",
    "    ### EVAL - TRAIN ###\n",
    "    for i in range(num_batches_train):\n",
    "        idx = range(i*batch_size, (i+1)*batch_size)\n",
    "        x_batch = x_train[idx]\n",
    "        targets_batch = targets_train[idx]\n",
    "        # what to feed our accuracy op\n",
    "        feed_dict_eval_train = {x_pl: x_batch, y_: onehot(targets_batch, num_classes)}\n",
    "        # deciding which parts to fetch\n",
    "        fetches_eval_train = [y]\n",
    "        # running the validation\n",
    "        res = sess.run(fetches=fetches_eval_train, feed_dict=feed_dict_eval_train)\n",
    "        # collecting and storing predictions\n",
    "        net_out = res[0]\n",
    "        preds = np.argmax(net_out, axis=-1)\n",
    "        confusion_train.batch_add(targets_batch, preds)\n",
    "\n",
    "    ### EVAL - VALIDATION ###\n",
    "    confusion_valid = ConfusionMatrix(num_classes)\n",
    "    for i in range(num_batches_valid):\n",
    "        idx = range(i*batch_size, (i+1)*batch_size)\n",
    "        x_batch = x_valid[idx]\n",
    "        targets_batch = targets_valid[idx]\n",
    "        # what to feed our accuracy op\n",
    "        feed_dict_eval_train = {x_pl: x_batch, y_: onehot(targets_batch, num_classes)}\n",
    "        # deciding which parts to fetch\n",
    "        fetches_eval_train = [y]\n",
    "        # running the validation\n",
    "        res = sess.run(fetches=fetches_eval_train, feed_dict=feed_dict_eval_train)\n",
    "        # collecting and storing predictions\n",
    "        net_out = res[0]\n",
    "        preds = np.argmax(net_out, axis=-1) \n",
    "        confusion_valid.batch_add(targets_batch, preds)\n",
    "    \n",
    "    train_acc_cur = confusion_train.accuracy()\n",
    "    valid_acc_cur = confusion_valid.accuracy()\n",
    "\n",
    "    train_acc += [train_acc_cur]\n",
    "    valid_acc += [valid_acc_cur]\n",
    "    print \"Epoch %i : Train Loss %e , Train acc %f,  Valid acc %f \" \\\n",
    "    % (epoch+1, loss[-1], train_acc_cur, valid_acc_cur)\n",
    "    \n",
    "    \n",
    "epoch = np.arange(len(train_acc))\n",
    "plt.figure()\n",
    "plt.plot(epoch,train_acc,'r',epoch,valid_acc,'b')\n",
    "plt.legend(['Train Acc','Val Acc'])\n",
    "plt.xlabel('Updates'), plt.ylabel('Acc')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# More questions\n",
    "\n",
    "1. Do you see overfitting? Google overfitting if you don't know how to spot it\n",
    "2. Try and regularize your network by penalizing the L2 or L1 norm of the network parameters. [Read the docs for more info](https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#regularizers)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
